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ABSTRACT 


Since 1951 when Robbins & Monro's pioneerirg paper on 
stochastic approximation was published, many articles have 
appeared dealing with extensions, modifications, methods 
pmo applications Of stochastic approximation. While the 
eomcepts involved are relatively simple, but mathematically 
IC cae zinmeormation concerning specific results has 
mee widely scattered and difficult to collect for the 
interested researcher. This paper will attempt to discuss 
che major results and will provice the necessary references 


eadi rect the User to more specific findings. 
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ieee NL RODUCTION 


In many areas of analysis in bioassy, sensitivity 
Meocane or learning we are concerned with a level of output, 
Ever a certain level of some input, x. For each given 
mre Of x, the resultant output is not deterministic but 
has some underlying probability distribution, F(Y|X). 
Hence it is then common to refer to the response function 
of x, denoted M(x), as the expected value of Y given x. 
meee. Mix) = ££ Y(x)adF(Y|x) = EfLY|x].) 

In usual peer erie ocres response function, M(x), it 
Ma ssumed that the function is of known form with unknown 
parameters say: 

2 


M(x) = Bo + B, X + Box ~ 


Bere the parameters, B > are st imeted on the basis of 


observations Ya Yo» ae tas rn as ORAR to observed 
values Xj Xos ss. Xoe Dre Eno Or least squares, Lor 
= ple, yields the estimators of By which minimize the 


sum of the squared errors. 
Beever cases Often arise in which one has little 
prior knowledge of tne actual form of M(*) or one is only 


mieecrescted in trying to estimate the value 9 sucn that 


M(8) = a 





where a is a specific desired response level. We desire 
e aind a sampling scheme such that 2 ON 


Robbins and Monro [Ref. 88] presented the following: 


THEOREM: Let M(x) be agiven function and a a given constant 
such that the equation M(x) = a has a uniquely defined root* 
feo. Let Y(x) denote a realization of an experiment at 


Teemoro) level" x. Assume Y(x) has distribution 


00 


ERES < Y) = H(Y|x) such that M(x) = ff YdH(Y]x). 


(I.e., M(x) = E(Y|x).) Choose X, arbitrary and define the 


recursive relation: x = x +a —- Y (X 
g nr n nta ail u 


Hi cchere Exists a positive constant C such that 


ERGO) < C) = 1 
and if 

Kon cone o > 0 

Mees .c — 6 for x< 
and 

OO | xu >u6 


* Note that this requires that for some 6 > 0, 
M(x) < a - ô anda) o + O for x >, 
but does not specifically require that M(@) = a. 


ll RA a 





then for al IPAn 


Lim El(x, - De 


Y] +00 
Mer procedure of recursively defining X +1 45 a runer ion 
of x by 

Zr, + a (a - Y(x_2) 


memreterred to as the Robbins-Monro method or procedure. 
(Note that the process is a first order Markov process, 
although it is in general non-homogeneous.) Papers which 
Memowed Robbins and Monro's discussed topics such as 
@emvercgence, finding the maximum (or minimum) of a function, 
mit dimensional applications, and accelerated processes 
to name a few. 

in the first few years of stochastic approximation 
ey papers by Derman (1956) [Ref. 25], Schmetterer (1960) 
[Ref. 101], and Loginov (1966) [Ref. 81] presented major 
ments Beomsnschenir respective date of publication. A 
text on the subject was attempted by Wasan (1969) [Ref. 129] 
Mere ceived Strong criticism because of serious oversights 
and many misprints. — While the aforementioned publications 


contained only the mathematical formulation, other treatments 


1 Dupac, V., Book Review, Annals of Mathematical 
istics, v. 41, p. 1131, 1970. 





by Fu [Refs. 53, 54] and Wetherill [{Ref. 130] contained 
Preecominanvily practical and intuitive information and little 
mathematical background. This paper will attempt to present 
the major results of both mathematical formulation and 
meoctical applications and to discuss the intuitive meaning 
where it is applicable. The list of references is intended 
ME. as complete as possible on the subject. Consequently 
way ol the bibliographical entries are not specifically 


referenced. 





TI. MOTIVATING STOCHASTIC APPROXIMATION 


mmcer perme odpl cantons, as in bioassy, sensitivity 
meme, Or fatigue trials, the statistician is often 
Mn ested in estimating a given quantile of a distribution 
Pemecion or a level of response. Situations of this type 
are candidates for solution by stochastic approximation 


Brenods. Examples of these situations are: 


A. QUANTILE ESTIMATION 

Suppose we are testing the resistance of a metallic 
component to fatigue fracture. Let F(x) denote the proba- 
that a specimen wi11 fail if subjected to x cycles 
mameecriat. Then a specimen, when tested in such a way, 
represents an observation which takes on a value one or 


meme depending on whether or not it fractures in x cycles. 


Damen che notation of the previous section, Y(x) = 1 if 
the specimen fractures and Y(x) = 0 otherwise, so that 
M(x) = ad). It is of interest to 


ww CO 


ate tne number of cycles, x, such that for a given 
a, F(x) = M(x) = a. 


ir LD; y 


We wish to administer sample doses of a drug to 
laboratory animals, say rats, such that we determine the 
dosage such that 50% of them die on the average. In this 
Bez 5 in our problem formulation and we desire to 


solve M(x) = .5 for x. 


10 





B. LEVEL DETERMINATION 

PUOpose 2) Pproaucvion it is desired to find the level 
Of some material such that a characteristic, say the 
Sity of the finished product, is a pre-determined 
devel. However each batch is subject to impurities and 
reacts as a N ochastike realization. A stochastic approxi- 
mation scheme may be devised to automatically set and 


MNSFect Lhe desired input flow to produce the desired results. 


C. ROUND-OFF ERRORS 

As stated by Schmetterer [Ref. 101] we can consider 
meapplication of the RM process for the problem of round- 
Is as problem occurs, for example, if one solves 
Stations by classical iteration process using electronic 
wers. Define for every real number X a random variable, 


ieee), in the following way. 


P) EIA SS, 


EA) A [x] 


Note that ElY(x)] = x. From here we can deduce as a 
Peer n for more general theorems the following result. If 
one solves a linear equation by an iterative procedure and 


menies it by using for every step of the iteration the 


Note that [x] denotes the largest interger contained in 
Bor example [2.87] = 2. 


oe ~*~ 





Matador rule given above, then the modified procedure 
Bomwerpes with probability one to a solution of the given 


equation. 


la 





Pre THE ROBBINS-MONRO METHOD 


itmene introduction te first of two theorems from the 
meme inal Robbins-Monrc paper was presented. This first 
theorem required that the response function, Y(x), be 
bounded, allowed discontinuities in the function M(x) = E[LY(x)], 
and did not specifically require that M(x = 0) = a, the 
desired response level. The second theorem of Robbins-Monro 


is presented below: 
THEOREM [Ref. 88] 
Let the sequence a be of the form 1/n and assume there 
exists some constant C > 0 such that 
EA E e = l; 
wwe nat the conditions 
C M(x) is a nondecreasing function, 
(iis) MCOor= a, 
(iii.) M'(6) > 0 
are satisfied. Then defining the recursive relation 


X = x + a Lo - Y(x)] 


implies the result 


115 





Lim E[ (x, - erg 


ry >œ 
NWE AKENING THE CONDITIONS FOR CONVERGENCE 
Wolfowitz [Ref. 132] in response to questions of 


Robbins and Monro showed that if the conditions of the 


response function, M(x), satisfy 


IM(x)| < CS, 


co 


BEN) - M(x))°aF(Y] x) = of < + , 


an OD 


along with 

M(x) < œ orx < 6, 

M(6) = a, 

Too Dom, 
M(x) strictly increasing when |x - 6] < 6 for some 6 > O 
and 


inf |[M(x) - al > 0. 
|x-9] > 6 


nen for x Gemamed tas in the RM process xX, converges je 


Meepability to 8. 


14 





Pee CONVERGENCE WITH PROBABILITY 1 

Kallianpur [Ref. 68} and Blum [Ref. 6], both proved a 
convergence which is stronger than convergence in mean 
square (which implies convergence in probability), conver- 
gence: with probability 1. Blum [Ref. 6] proved that the RM 
Mm@eeess converges with probability 1 under conditions even 
weaker than those of Wolfowitz [Ref. 132]. While Wolfowitz 
required that the regression function M(:) be bounded, 


Blum only required that it lie between two lines. 


BLUMS'! THEOREM [Ref. 6] 
Let M(x) be the regression function. We assume that 


[eo is measurable and satisfies the following conditions: 
M(x) < C + d|x| 


r some C, d > O 


S (Y =- M(x))*aR(¥|x) < 0% < +, 


Mix) <a SO 
and 
M(x) > a LN 
inf |M(x) - a] > 0 
Sl se 


15 


ih: 
u 





Mor any pair SE 57), if moreover 


and 


Enen x comme eeoa cory Wiuh probability 1l. 


C. A FURTHER WEAKENING OF CONDITIONS FOR CONVERGENCE 

In 1963 Friedman [Ref. 51] further weakened the requir - 
NS Tor convergence with probability 1l by removing the 
necessity for M(x).and ae) to be bounded by a linear 
Miimer On and a constant respectively. Friedman's theorem 


is presented here: 


THEOREM [Ref. 51] 
Let f(x) be a function which is positive and bounded 
in any finite interval. Let the following conditions be 


satisfied: 


16 





M(x) < (L |x| + K)f(x) 
wn constants L, K, M > Q, 

o*(x) < of (x), 

MES cor. x= < 6, 


and 
Mess zer for x > 0, 


inf |M(x) -al> 0 
6, <|x-6|<6,, 


for any pair (87,65), then the sequence defined by 


u a la - ME Zr) 
menverges to 9 with probability 1. 

This theorem of Friedman enables one to construct a 
convergence process when |M(x)| and oó(x) are bounded by 
known functions £, (x) and f(x). One then takes 
f(x) = MaxC£, (0, (1,00). This procedure is AO applicable 
where f(x) is decreasing to zero for large values of x. 


However the convergence is relatively slow. 


Lu 





Later Gladyshev [{Ref. 57] simplified the conditions 


for convergence with probability 1 with the following: 


THEOREM 


bet M(x) be a measurable function such that 


Im? (x - 8){M(x) - aj > O hone al. ace O 
e<|x-6[<e- 
Moreover assume there exists a positive number d such that 


for all x we have 
ELY (x)] < a(1 - xó) 


and if lan? satisfies the conditions previously stated then 
Xn converges to 8 with probability 1 where x is defined 


as in the original RM process. 


Pee tHE MULTIDIMENSIONAL CASE 

Blum [Ref. 7] was the first to generalize the Robbins- 
Monro process to the multidimensional case. He considered 
the following problem: 


Assume that we are given a family of N random variables 


Y (3. ..X,), Zu Y (rx) 


feet distribution functions 


EL Oo) 


18 





moreover assume that 
M, (x) 9++-X)) = J Y,dF, Dora eye, 


a.e. M, er one pomding regeression function in the 
dimension). 
mets desired to construct a seguence whose limit is 


Mmeemroot vector of the system of equations 


M, (OX) oe X)) = 0. 


fer simplicity it 1s assumed that all a, = O and that 


M, (0) = 0. 


bet f(x) be a real function that is defined on real 
feeemensional space and has continuous first and secona 
derivatives. Let A(x) = (3° £/3x, 3x, ) denote the matrix of 
second derivatives and let D(x) = (91/9x,) denote the vector 
of the first derivatives. 

In matrix form the Robbins-Monro process is of the form 
E x. +a Y where it is assumed as before that 
n=1 ¡TA 
oO 00 
£ am and 2 a aco 
=] n=1 

Observe the following notation: 


mew. x) = <D(x),M(x))> (i.e. the scalar product of. D(x) 


ma M(x)). 


19 





‚N = E r 
Let v(x) E{<Y A(x + Qa MS where 0 <q <l. 


THEOREM [Ref. 7] 
Wee Ghere exists a real function f(x) with continuous 


derivatives that satisfies the conđitions 
E O 


ES iS 0 Ior e > ~O; 
x||>e 


ELE) (0) > 0 fore> O, 
e 


E) AS formalita, 


then the sequence ico as previously defined converges 
most surely to zero. 

ME ould also be noted that the multidimensional case 
is a direct extension of theorems by Derman and Sacks [Ref. 26] 
and by Gladyshev [Ref. 57] where we think of x? oe E 
mee M(x) as N dimenc Tonal vectors and treat multiplication 
De vector Scalar product. Of interest is a special case 
memeche regression function, M(x) is linear, M(x) = a 
e MN is asymmetric matrix. The following modified RM 
process was proposed by Dupac [Ref. 33] for this special 
case. 

meme. 415 a random vector whose distribution function 


is ery ~ a) where Y, is a random vector with distribution 


20 


function BC Ys): Then the following theorem is presented 


ey Dupac. 


THEOREM [Ref. 33] 


Assume that 


TAN = M(x) || aF (Y |x) z C 


n 


and 


are satisfied where the A, are the characteristic roots 


Beene matrix M. Then if a > 1/2 then the sequence 


1? 
defined by 


tl n 


Bemwerpses to 8 with probability 1 and 
2 
EI O A) 


There are certain situations where the multidimensional 
case can be reduced to a one dimensional case. The results 
are by Eppling [Ref. 37] and require general stochastic 


approximation theorems of the Dvoretzky type. 


Zi 





fee DVORETZKY'S GENERALIZED PROCESS 

Dvoretzky [Ref. 36] has suggested that any stochastic 
Seproximation procedure may be viewed as an ordinary 
deterministic (error free) successive approximation method 
With a noise component superimposed on it at each step. 
On the basis of this concept a very generalized class of 
meeedieaslic approximation theorems can be studied. 

Assume that T OX, oe X) is a Borel-measurable sequence 
Be ronstormations from n-dimensional Euclidian space, R > 
into R,. One may then construct the sequence from the 


AL 


relation 


a + Z 
x X, ) al 


aie Ts +. 


where T (Xz are 2%) IA errores transformation ana 


Le Eene error, Dvoretzky then proved the following theorem: 


THECREM [Ref. 36] 
Bet {a}; (B_) and as be sequences of non-negative 


real numbers, Suen ChaT, 


22 





Moreover, assume that the condition 


JT ee 1 ) - o| < Max a O ra 


n 


M atilsfied for all real r PhS also that 


EEE 
and 
E G) = 0 
with probability one.* Then the sequence {x} defined by 
X = TO Xz ote Xp) + Zn 


ntl 


converges to the desired quantity, 9, in mean square and 


Pee Probability 1. I.e., 


lim E[(x, - 9)*] = 0, 


; n> o 


and 


Note that this condition is satisfied if the Z 
are a sequence of independent errors for which E(Z,) = 0 
wall n. 


> 





It can easily be shown that the Robbins-Monro procedure 
is a special case of Dvoretzky's generalized procedure. To 


do this write the normal R-M relation 


tl 


X 4 x. + a, ta _ Y(x_,)) 


as 


PS 
I 


a a Ca-M(x,) 5 a [Mix )-¥Cx_) J. 


teem Letting 
TP. (Xy 9. .9X,) En + a, Lo — M(x )J 
and 


2 = a, [Mix,) - Ya, )] 
we have the RM procedure in Dvoretzky's format. Ina 
similar manner the Kiefer - Wolfowitz procedure, which will 
cussed later, can be written as a special case of 
Dvoretzky's theorem. | 
Dvoretzky extended his generalized procedure even 


further by replacing the sequences a. Da ana Y by 


24 





non-negative functions a (r],...r), Emo: and 


COS ee, respectively provided that they satisfy the 


ll owing conditions: 


(1) Theme toms a \(r.,...,r.) are 
Meo oA. n 
uniformly bounded and lim a (Py 3... ,1,)=0 
ustrormily for all a 
Sequences Py... Postes 
Gi) the functions B (Py 30. .,1,) are 


measurable and 2 B, (Py >.> 


"or is 
uni ormly boundfä-and uniformly conver- 


gent for all sequences Pyaro co PT. 
ehe functions Ya (rje e’ rh) satisfy 

nr Y, (Pp 300. ,P,) = © uniformly for 

Eure realenecee Pyare’ Ph for which 

SI a where L is a finite 

n 
E 
number. 


The introduction of Dvoretzky's general conditions 
allowed regression functions of the form M(x) = -xf or 
M(x) = Exp (-x?) to be applicable to stochastic approxima- 
tion type theorems. The most comprehensive presentation 
Or Dvoretzky stochastic approximation theorems has been by 
Venter [Ref. 120] in 1966. Venter's theorems generalized 
the work of Dvoretzky [Ref. 36] and Wolfowitz [Ref. 133] for 


transforms on the real line, of Derman and Sacks [Ref. 26] 


23 





M annn ce dimensional Euclidian spaces, and of Schmetterer 
feet. 101] for Hilbert spaces. 

Block [Ref. 5] had proposed a more general type Oi 
me@eenasvic approximation taking place on a normed vector 


space. 


PECTED SQUARED ERROR 

While Blum [Ref. 6], Dvoretzky [Ref. 36], and Dupac 
[Ref. 32] were establishing conditions under which 
pas EL(x -6)°] + 0, others such as Chung [Ref. 14], Hodges 
and Lehmann [Ref. 64], Kallianpur [Ref. 68], and Schmetterer 
[Ref. 101] were trying to establish bounds on bo: (Note 
that b, = variance + (bias)”.) Below is Schmetterer's 


result for the bounded case. 


THEOREM [Ref. 101] 
Let M(x) be a Borel-Measurable function that satisfies 
(i) P{|Y(x)|<sc}=1 for some constant |C|<+» 
and 


Cin) (x-6){M(x)-a}>0 for LURO 


So there exists an € > 0 , and positive constants Cy > Cos 
auch that 


(iii) |M(x)-a| 


lv 


c,|x-8] ma |x] , 


er for |x-8|>e 


(iv) |MGO-a| > C, 


[v 


26 





Then 





DS dy I Me AL a a "e [L T (1-72 D- 
i=] i-l il fem i=l r=] oma 
> n 
Mere e, =E (y,-a) » gy is defined as 1, A, = Pr : 


where exists a constant, Cas such that 


c 
[M(x)=a] > 5 lx „-0] with probability i. 
= 


An 

As was noted above this theorem holds for the so called 
"bounded case." This same result holds for the quasilinear 
case if conditions (i), (iii), and (iv) are replaced by the 
Following conditions: 


There exists a Cy suchzchat 
E({LY(x)-M(x)]°} < C, 


endzche quasilinear conditons that there exist Ce and Ces 


AS Ce such that 


5 


C5|x,-9| > |MGx)-al > Cglx_-0] , 


Then the above estimate of BD holds when a, morsubpstituctced 
for as Aa and 206 subasta tuted for C3. 
Therefore if a, =a/n where a > 1/2C¢ then b_ = 0 


This latter is the most frequently used result. 


ey, 





foe AreClED SQUARED ERROR IN THE LINEAR CASE 

Hodges and Lehmann [Ref. 64] analyzed in detail the case 
Beet 16 desired to estimate the value of x for which 
M(x)=0 where it is assumed that M(x)=Bx, and variance (Y(x)) 


eg”. nens ie” ii process to define {x} yields 


n-1 n-1 
pa? II ea) 


r=1 s=r+1 


l 2 


ne 
b = ELx, “J = ll a (1-Ba,,) 1° a 0 


P r=1 

iaeanalyZing this expression it becomes obvious that the 
first term represents the expected bias based on the initial 
Analice, X15 while the second term represents the variance 
component of the error variance. Since Chung [Ref. 14] 
em plished that under certain conditions the sequence 
a e/n gives most rapid convergence of > Omer Ls Or 
Interest to analyze the expression for expected squared 
Sor tor tnis family of coefficients. 

Porsche first (expected bias to initial value, x4) term 
the expected bias = 0(n °°8) if (c/n)7+ > 8 for all n, 
but becomes quite large if em ee theueen 2.2 mMoucd 
by Wetherhill [Ref. 130] it would be more desirable to tend 
to overestimate c = ß”" rather than underestimate. 

The analysis of the second term is more complicated but 
Hodges and Lehmann have shown that it is asymptotically 


Pemrvalent to 


28 





2 2 


O ep 
n(2cß-1) 
olog n 
e TI c = 1/28 


Unß 


and we note that c should not be less than (28)”” because 
meee Large bias which results. 

lese results give us conditions on the sequence, 
a =c/n er iMewen none Me ckepewOneseme regression function, 
methat Che bias, resulting from an initial bad guess, 
eey tends to zero with increasing sample size and the 
expected squared error of x, is of the order 0(1/n). The 
Ms nortcomine of the linear model is that we do not know 
how nearly linear M(x) must be nor how nearly constant 
variance (x) must be in order that the linear approximation 
Will represent what actually happens. The only evidence or 
Berne consists of a sampling experiment by Teichroew 
Ref. 109]. There it was found that the linear theory is 1 


m ezenable agreement with the data. 


Hee RATE OF CONVERGENCE 
In a recent paper Komlos and Revesz [Ref. 136] presented 


Dates ol the rate of convergence of the R-M process in a 


MN re concise than any previous result. They considered 
the case a = 0, 9 = 0, En 1/n and presented the following 
estimates. 


29 





For the case where there exists L > 0 such that 


PLIY(x)-M(x)| < LJ=1l and Lim M(x)=M(») > L, 


X >00 
if the conditions 
M(x) < C} X + ds if x> 9= 0 
ME e Ed aK EOT, 


2 2 


m@emoacvisitied for positive constants Ci» Co» d,s da, 


Bnen Dee £ aor 


for any € > 0, where y = y(e) > 0. 
For the case where M(~) < L the rate of convergence 


Ch Slower than for the previous case. Specifically 


M (0) 
- 6 
z 





BEE S e 


any 6 > 0, n > ny (6). 


ASYMPTOTIC NORMALITY 

The asymptotic behavior of the higher-order moments an 
Mas ymptotic distribution of the random variables defined 
by the sequence {x} Hose ti resmeeemeiaered im detail by 
Chung [Ref. 14]. His method is based on the moments of 
(x,-0) and his results have been widely used in papers on 
meeem@estic approximation. Chung's fundamental result is for 


the "bounded case" and can be stated as follows: 


30 





THEOREM [Ref. 14] 
Let M(x) be a Borel-measurable function that satisfies 


me following conditions 


P{|Y(x) | < Cy} = 1 for some constant, SE 
(x-B){M(x)-a} > 0 , 


M(x) = a + a, (x-8) SN 5 


inf IM(x)-a] = K¿(8) 0 

|x-0|>8 
and 

2 2 

ENS E) o E = for all x 

Meee = 1/n ; 
n 
where 1 < e < l 
2(1+C,) 2 


and where C5 is determined by the condition 


IM(x, )-al ze en |x, -6] 


2 


meme tor any integer r > 1 


r 0 if r is odd, 
Lim nO E)5 E[(x,-0)%7 = > r 
n>o | . Ko /2a, )2 (r-1) if r is even, 


and the random variable EI (y o) is asymptotically 
normally distributed with mean zero and variance = 07/20, . 

A similar result was obtained by Chung for the quasi- 
linear case (i.e. M(x) lies between two straight lines with 
nonvanishing slope). In this case the boundedness assumption 


is replaced by 
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K|x-8| < IM(x)-a| < K, |[x-6| for K > 0, K,< 
and 


EL{Y(x)-M(x)}P 7 Zee for p an even integer. 


Pen using a, = c/n where c > 1/2K, the distribution of 
n1/2(x -0) temas ve mormal with mean zero and variance = 


00° /(2a, 0-1). 


For the special case where M(x) is linear Chung proves 
wote Normality by using characteristic functions in a 
very concise proof. While Hodges and Lehmann [Ref. 64] 
improved some of Chung's results, Sacks [Ref. 95] utilized 
a central limit theorem for dependent random variables to 
mee more general and more complete results about the 
asymptotic normality of X Below is a theorem of 
Gladyshev [Ref. 57] that is a strengthened form of Sack's 


Mindamental result. 


THEOREM [Ref. 57] 


meeume that the following conditions are satisfied: 


inf . (x-9)(M(x)-a) > O Por e -0R 
e<|x-8|<1/e 


M(x) = a + a (x-0) O CTG 
Maea sts ad aea O0, such that for all x, 


ELY“(x)] < a(1+x*) , 


Mim EL@(x)-M(x))*}7 =. > 0 , 


n>% 
Lim Lim Sup  — EL(¥(x)-M(x))®@,(x)] = 0, 
N>wo €£»0  |x-0|<e 


e 





where 
ror v(x), > N , 
O (x) = 
Gore) EN, 
and 


a = An”? is such that Aa, > 1/2 


17 


Mem the distribution of n “(x -0) tends to normal with 


Mean zero and variance = A (2Aa, -1) "mp 


SELECTION OF STEP SIZE, a, 
As we have noted thus far the sequence la, must 
essentially have the same asymptotic behavior as the 


harmonic series, 1/n, which satisfies the conditions 


oo co 
eae, = 0 3 and er: sa y 


Pemmecie intuitively see that the first condition is necessary 
to guarantee that the sequence, Em , does not get trapped 
nv finite interval while the second condition is neces- 
pee Or the convergence of the expected squared error term. 
However it 18 reasonable to ask if there is a sequence, 

fa} , which minimizes EL(x,-8)°] after some fixed number 
of observations, say N. Dvoretzky [Ref. 36] solved this 


problem for the Robbins-Monro Process. 


THEOREM [Ref. 36] 
Assume that a random variable, Y(x), satisfies the 
conditions 


(1) ELY*(x)] < 0% < « 
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end assume that M(x) is such that 


11) o<as BIZ <Bc<o 


ome if it is known that 


Gai) x 


eso er 


Meet 1f the sequence, a. = A, is used in the Robbins- 


2 Ge NA 
Hema process, then the resultant 


2 ae" 
EL(x. - @)°]) < ——— 
Bee ES 
Mecbtained. The choice of y here is optimal in the 
minimax sense in that for any other choice of fa} there 


Smee (xX) and x. that satisfy conditions (i) and (iii) for 


i 
Ben the above bound on expected squarred error does not 
Hold. 

Now it is obvious that PAS aos OL limited 
use to the experimenter who has little apriori information 
with which Be choose an: reretoren tor practical choice of 


mer Sequence, a: tie weecoeer 1S directed to Section V.A. 


Mere this problem is discussed. 


K. ACCELERATING CONVERGENCE 


When the initial guess, x is far from the desired 


Ta 
Value of 9, the Robbins-Monro procedure approaches 8 very 
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Slowly because we are taking smaller and smaller steps. 
Kesten [Ref. 69] proposed the method of accelerating the 
memeereence of a stochastic approximation algorithm based 
feeot decreasing the step size, a if the difference 


(x. - Xx, -1) has the same sign as (x -= X_0)> and decreasing 


n n-1 


the step size if the signs differed, indicating that we 
may be in the region of 98. (Higher order schemes are also 
proposed.) 

ft was shown that there exists a 9', not necessarily 
me@emor1cal with 9, about which fluctuations in sign occur 
femme frequently in a finite number of trials. The value 
coe x = 0' is defined by the intersection of the line 
Y(x) = a and the locus of medians of the densities 
A) for any x. If the density PA 0) is 
emmeuric, then 9' = 6. Even if the fluctuations occur 
meee a 0', different from 6, x still converges in 
Beepability to 9 as Kesten proved. 

Authors such as Odell [Ref. 87], Sinha and Griscik 
[Ref. 105], Sielken [Ref. 104], and Newbold [Ref. 86] have 
w ented accelerated stochastic approximation methods of 
their own and have ET with the criginal R-M 
method and Kesten's method. 

Another method of accelerating convergence was proposed 
by Fabian [Ref. 40]. This method is an analog of the method 
of steepest ascent (descent). Fabian proposed that the 
step an Pem@ecwormineca if the following manner; for given Xn 


and y one Hakeomonserics Of observations, E (where the 


Ele 





observations are assumed to be independent of x. and Yad 
mn the quantity M(x, + aay) r l2... until sign 


V = ... = sign Mae = sign Ue = -sign Yan . Then choose 


l a 
ja. (Note here a = 0 = M(8).) Fabian proved that under 


w 
It 


very general conditions on k iteration methods converge 
Mech probability 1l. 

Authors who are interested in the practical or experi- 
Mental aspects of stochastic approximation have suggested 
mere Che approximation method be carried out in two stages. 
rst stage would take large steps to estimate the 
O interest while the second stage would take pro- 
Press ively smaller steps and A the fine tuning 
stage. (See Davis [Ref. 22], Wetherill [Ref. 130], and 


Goodman, Lewis and Robbins [Ref. 58].) 


meeeOONPIDENCE INTERVALS AND STOPPING TIMES 
After k iterations it may be desired to obtain an estimate 
Bi and d such that 


P(|x = ee d) > lee ey 


k+1 
Farrell (Ref. 50] did some of the first work on confidence 
intervals of bounded length but required a priori knowledge 
of a bounded interval containing 9. 

The subject of stopping times of a non-parametric nature 
is an almost untouched area. Farrell stated that Mrs. Nancy 
er, Cornell University, had been studying closed stopping 


rules and bounded length confidence interval procedures for 
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miemmedian of a distribution function. However very little 
Be zappeared in stochastic approximation literature concerning 
Æ pping rules. 

The most general discussion of stopping times based on 
the asymptotically normal result was recently presented 
by Sielken [Ref. 104] and is stated below. 


Mesine the definition, 
Z(x) = Y(x) - M(x) 
en der the following conditions, 


(1) WVelomompoctoive constant less than 1/2. 

(2) The sequence em opo itive constants 
is such that ey Pa Cum Tl OO, SOT 
Somer CO < Cc <a 

(3) The sequence en on 
Kocmesrniezazeenstant such that 2Aa. > 1. 


i 
(4) M is a Borel-measurable function. 


Cop nome > 0, inf Mono 
Ee<x-de — 
and sup 5 M(x) - a < 0. 
E<B-x<E 
(6) For some constants K, and K, 


IC E E Bee for all x. 


1 
(7) sup EL|Z(x)|°] = w. 


x 
Cee reece Biz(e)- 1 = 0é > 0. 
X79 


Ss 





(9) Lim Lim, Sup N IZ(x) | aP=0. 
Ree e+0 |x-8|<e |z(x)|>R 


(10) For some positive constants g and Mya 
if |x-8| < g, 
then M(x) = a + a, (x-0) + $(x), 
where 6(x) = o(|x-6]) as |x-e| o0. 
MN etheedrscripuvaon function of Y(x)., 
denoted F(Y|x), is such that for every 


y, F(y|-) is Borel-measurable. 


and 
(12) There exists e > 0 such that for every 


DOS ute integer fe 
Sup A ACI IR 
|x-0|<e 
Hen assuming that a 100(1 - 2y)% confidence interval on 
meer length 2d is desired, the proposed stopping time for 
the R-M process is denoted N where N is the 
E As fal d,y,1 


Emarlest positive integer, n, such that 


ie Oe 2 2 
n>K A Sa AT OT 


Mes principle results of Sielken are: 


THEOREM [Ref. 104] 


Mic onditions (1) -*12) above are satisfied then 
o 2 , 
A Kena Alone, = 1)a°] = 1 , 
a>0 sY’ L 


reece Probability 1l, 
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and 


Lim P(|X 
d>0 


- 8| <d) = 1 - 2y. 
N a = 
ee 


Feen nas stated trat the limit in the theorem can be 
maeerpreted as either: 
era electo Che sequentially determined 
bounded length confidence interval con- 
verges to the prescribed level, 1-2y, as 
nee Er aalenecne za, seonverges to zero; 
OT 
D neso babl ity that the error in the final 
estimate of Glas less than or equal to d 
converges to Che prescribed probability, 


e Vas Agaa. 


M. DYNAMIC STOCHASTIC APPROXIMATION 

Fabian [Ref. 39] and Dupac [Ref. 34] have considered the 
Mmm@emeriocre tne desired level, 9, changes during the iteration 
process. The following discussion is by Fu [Ref. 53] based 


on Dupac's presentation. 


Let M, 60 M(x - Zu + 6, ) such that 65 is the unique 


root of M (x) 0. Let a be a sequence of positive 


4 


memeers, and let x. be an arbitrary random variable. 


{ 


> E a = 1 
Define: ee ee Dos 


where 


-] 
re = (1 tn IX. > 


a 
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E[Y(x,')|x,,...x_] ee aa 


and 


2 
var[Y(x, ')Ix] >... >%,] = 00 < +0, 


The meaning of the above algorithm for computing X +] with 


the modified x_, i.e. x_', is that when we get an estimate, 
n n 
x? on 0» Ne eMmaktce a COPPeeurTonm Tor trend to obtain =. before 


computing x It will be seen by the following theorem that 


ntl’ 
the use of this modified algorithm is justified when En is 


a linear (or nearly lineár) function of n. 


THEOREM [Ref. 34] 


Assume that the following conditions are satisfied: 


G) Mis O Torax < O, and M(x) 2 0 for x > 1° 
(ii) There exist Ky» K, such that 
MESS 0 


Ky |x - 0 Toms ner. 


1! 1! 
(ash) a, = a/n”, A A A A 
(iv) 6, varies in such a way that 

> 


8 = en 


a Y 
a] 6 = O(n §) forw>a 


and eS E(x, °) ER 


Then (x, ~ 6) approaches zero in the mean and 
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O(n””) for 4 <a <2/3 


2 
EL ( - 6 )"Jj = 
“n a - 20) 


O(n ) for 2/3<a<1. 

The mean square convergence, as well as convergence with 
meobability 1, can be deduced from Dvoretzky's theorem, even 
meer slightly more general conditions on oe A similar 
Modification to the Kiefer-Wolfowitz procedure is indicated 
momsolve for a moving maximum of a regression function. 

An interesting algorithm is presented in Fu's book 
(Ref. 53] for learning of slowly time varying parameters 
using dynamic stochastic approximation. Here Kesten's 
accelerated scheme [Ref. 69] is coupled with Dupac's dynamic 


process to improve the rate of convergence. 


N. CONTINUOUS STOCHASTIC APPROXIMATION 

Mor der. to obtain a continuous version of the stochas’ ic 
o ximation method, one can replace the difference recur ive 
relation in the EE cece With a stochastic idifferential 
equation. Again letting the desired level of response, a, 


be equal to zero, one obtains the general expression 


Nee eat), 


where a(t) satisfies the conditions 
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ote = œ and f a? (t)at > 
O O 

Tne above O determines a contínuous process for 
emochastic approximation CEST Ion to the equation 
M(x) = 0. Driml and Nedoma [Ref. 31] proved that the process 
converges when Y(t,x) is monotonic in x and when Y(t,x) is 
Bene form Y(t,x) = M(x) + h(t) where h(t) is an ergodic 
process with zero mean. In both cases the function X(t) 
approaches the desired value, 0, with probability 1 as 
moo, In the proof by Driml and Nedoma 

af O t <l 


ai a= 
Ne or A 


NE XTENSIONS OF CONTINUOUS STOCHASTIC APPROXIMATION 

As was experienced in the discrete case the one dimen- 
Moma eontinuous case can be extended to the multidimensional 
Case. However many theorems which are valid for the one 
dimensional casera mole laa tor the multidimensional case 
which depends heavily on stationary point theorems. (I.e. 


meeorems concerning a point x, of some space X for which 


O 


F(x? = x, where F maps X into X.) For a discussion of these 


0 
theorems see Driml and Hans [Ref. 30] and Hans and Spacek 
[Ref. 61]. 


Bae representation using continuous stochastic approxi- 


mation is by Kitagawa [Ref. 71] who formulated a Robbins-Monro 
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femet where the Brownian motion process is used to represent 


the random disturbances inherent in the observations. 
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IV. FINDING THE MAXIMUM OF AN UNKNOWN REGRESSION 
FUNCTION: THE KIEFER-WOLFOWITZ METHOD 


A problem of practical importance with a regression 
[mmmerion, Y(x), is to estimate the value of x, say 6, at 
which the expectation of Y(x), denoted M(x), is a maximum. 
MO intuitively introduce the method consider the following 


argument from Wetherill [Ref. 131]. 


v(x) | 
| 





(a) (b) (c) 
Fase ieee 


sa 


Suppose two observations, y(x,) and y(x,), are taken 4 


values X, and X, where X) < X3. Then 


mae it y(x) < y (x5) one expects the maximum 
level, 0, to be at a value > x,. 

©) If y(x,) > y (x5) one expects the maximum 

ro ee o be at a value < Xq 0 

Me). et y (x, ) ioma bpout equal to y (xo) more 


observations are necessary to determine the 
( 


region of interest. 
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Hans tt would-bezzreasonable to take further observations 
w e direction indicated by the slope of the two Y values 
end the distance moved along the x-axis, before taking 
Becher observations, should be proportional to the difference 
between y(x,) and y (x5). Using this basic idea and the 
Mi dal results of Robbins and Monro, Kiefer and Wolfowitz 
[Ref. 70] defined the following procedure for stochastic 


meeroximation of the maximum of a regression function. 


THEOREM (Ref. 131] 
Let M(x) be a regression function and F(Y|x) a family 
Mist ribution functions and assume that the following 


Memaictions are satisfied: 


co 


S (x(x) - MG) aF(Y|x) < 0% < + 


P@amassume that M(x) is strictly increasing for x < 9, and 
Maer M(x) is strictly decreasing for x > 8. 
fet De and. te} be infinite sequences of positive 


real numbers such that 


(for example: a =n ° and eg e 


scheme defined by 


a 
n 
= Fe = — 

X +1 x = LY (x, + cn? Y(x c )] 
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memverges in probability to the maximum, 9, of the regression 
meet ion Y(x) if three regularity conditions are satisfied. 


They are listed here with their intuitive meanings. 


Condition 1.% There exist positive B and B such that 


| x = | T |x. = 6 | < B implies mee < B] x, -x. | 


i 
for all x, 4X5. aoea A the function, M(x), has 

a derivative, it must be zero when x = 6; as a result 
the derivative must be bounded in the neighborhood of 

O. 

Condition 2. men ex sispasıIt!ive pP ana R such thal 
E- Xo | mp es [M(x ) E ey er 
words if M(x) increases too abruptly in certain regions, 
there exists a positive probability that it may reach 
NN eas a resulti, the Lipschitz condition must 

Pers atisfied. 

Bondition 3. For every 6 > 0, there exists a positive 
(8) such that |x - 6| > 8 implies 

inf M(x + e) - Mix E To) AS Tie is a 

6 >e>0 

Memyerilac funetion the rate of motion toward 9 is small. 


As a result, the absolute value of the derivative must 


be bounded below. 


* As Blum later proved [Ref. 8], the above theorem 
holds even when Condition 1 is not satisfied. 
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Mie pese rerpularity conditions seem restrictive it 
is only necessary that they hold in an interval [ce] en] 


Mere if is known a priori that c., < 9 < Co: Suppose, however 


il 
MET some proposed level, a, t C> Ties. outside the interval, 
[ce] >05] and one cannot take an observation at that level. 
If one then moves X, 50 that Che offending a a Sa is at 


the boundary (ey or C5) we may proceed as directed and the 


conclusions remain valid. 


CONSTANT COEFFICIENTS 

Burkholder [Ref. 12] proved that under certain conditions, 
the Kiefer-Wolfowitz procedure can still be used if Cr is 
Meroe constant for all n at a particular value, Co* Xa is 
mien asymptotically normally distributed with variance 
proportional to n+. This result is difficult to use in 
practice since there will rarely be enough information about 


maeeresponse curve to choose c, as required by Burkholder. 


O 
D ONVERGENCE WITH PROBABILITY 1 

The e ee oro. Dres ls a specialMcase Of the 
Dvoretzky process. (I.e. the process can be written as the 
sum of a deterministic term and an error term.) This can 
be seen Dy Writing 


a 
= + ames nes = 
x 41 x, o. [M(x + 2? M(x, oe +2 > 


where the error term is 
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a 
2 2 EY, + E = M6 on) 7 Y(x, - cn) + Mix -c,)]. 
It follows from a theorem by Dvoretzky that the Kiefer- 
Wolfowitz procedure converges with probability 1 and in 

mean square under conditions weaker than those imposed by 
Kiefer and Wolfowitz. Burkholder [Ref. 12] also proved 
convergence with probability 1 using a somewhat different 
approach. Later Venter [Ref. 122] showed that the K-W 
Bernd converges almost surely to the maximum if this is the 
ey stationary point of the surface and some other condi- 
mes are satisfied. This result is stronger, in a sense, 


mea those existing previously. 


C. MULTIDIMENSIONAL KIEFER-WOLFOWITZ 

Let (Xy an. Xy) be a family of random variables; let 
E Cetec ornsespeondlane distribution function; 
and let MCX a +++ 5 Xy) Ferenc or res pondine regression function. 
Memeemen desire to find a vector X = 6, for which the regres- 
Seem furiction is er Assume that M(x) has a unique 
maximum ak Che point x = vO. 

Blum [Ref. 7] constructed a multidimensional K-W process 
ne Following manner. Let X € Ry and let (ey. seyn) be 
Ao rctnonormal basis in R,. Then for some real c > 0, we 


N 
make N + 1 observations of the random variable Y(:), 


Y(X), Y(x + cej), Y(x + Coda ro. Nee cen) 
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and consider the vector 


er ce = [{Y(x = ce, ) SO ies 5 LLC + ce, )-¥(x)}]. 
nen beginning with some arbitrary vector, Xq > construct the 
sequence 


a 
n 
= ey 
ntl An 2, (x); 


where 


Y(x,) denotes Y, i 
nan 

Bemove the vector of first derivatives of M(x) by D(x), and 
the matrix of second derivatives by A(x). Then the following 


theorem by Blum is presented; 


THEOREM (Ref. 7] 
Let ta? and zen be sequences of positive real 


numbers that satisfy: 


im 8 
A) 
O 


2 a =o, IR e koe, and 
Merecver assume that Y(x) and M(x) are such that 


M(Y¥(x)*) < o? < œ, 
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M(») is continuous together with its first and second 
derivatives, and for any € > O, there exists a ple) > 0 


such that 
[|x|] > €  implies that 
M(x) < -ple), and 
EX | > 


where the partial derivatives a“ M(x)/ax, dx, are bounded 
ai i, jzl,...,N. 

wien tne sequence tx, oem lOuce yecer ited conver tes 
memo O With probability 1. Note that each step in Bium's 
algorithm requires N + 1 observations. Gray [Ref. 59] 


proved that the multidimensional K-W process defined by 


A 
O 
N 


Ba + Ed Y 4 nen)! 


O 
N 


{¥ (x, - Ce)... YX, - S 


which requires 2N observations in each step. 
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me ASYMPTOTIC PROPERTIES OF K-W PROCESS 

tne Tirst resultes concerning asymptotic properties of 
the Kiefer-Wolfowitz process were obtained by Derman [Ref. 24] 
and Dupac [Ref. 32] based on the lemmas of Chung [Ref. 14]. 
Sacks [Ref. 95] has discussed conditions for asymptotic 
normality of x. Ti ea TE OS NAO tend tO Zero, then the 
@eympcotic variance of x, can never de means. ur in 
Oraer of magnitude, as Burkholder's result of being propor- 
Mal ton with c = Cy a constant. The most general 
Results without a priori assumptions about the length of 


the interval containing the point x = ® have been obtained 


by Sacks. 


THEOREM [Ref. 95] 
Let M(x) be a measurable function with a unique maximum 
fee 9, and assume that this function satisfies the 


mencitions: 


(x-8)(M(x-e) - M(xte)) 


(i) PaT 


. > 0, 
zer 


O<e<e 
O 


MAS A IS AE << 8; 


O I 2 
(a) for all x, M(x) = Ay 7 a (x - e)? IRS TO, 
where a, > 0 and 6(x,0) = o(|x-e|°) 


ib 


as |x-0| > 0; 


(iii) for some Cy > 0, there exists positive 


constants K, and K3, such that for all x 


and all c for which 0 < Cc < Cy 


JL 





K. (x-8) °<(x-8) [M(x-e)-Mixte)Je”"<K,(x-8)°; 

(iv) For every e > O there exists a Sn 0 such 
Deal er satisiying 0 < c < cr and 
all x satisfying |x-8@| < c¢ 


|S(x-c,8) _ E < ee | 


Further assume 


nee o-/2 
x>6 


and 


Dem Lim Sup f (¥(x)-—M(x) )“aP = 9 
Roo ¢€>0 [|x-8ļ|<e  [Y(x)-M(x)|>R 


MEN if a, = A where A > 1/2K,, the random variable 


ie 
1 
are (x, - 8) is asymptotically normally distributed with 


Mean = 0 
yal 


Variance = o SAS (Ban - 1 
Sacks, in the same paper, also gave the similar asymptotic 


cine distribution for the multidimensional K-W process. 


E. MAXIMUM SAMPLE EXCURSIONS IN KIEFER-WOLFOWITZ PROCESS 
When we seek a maximum or minimum using the Kiefer- 
Wolfowitz process the possibility arises that we may be 


working with a function with more than one local maximum or 


2 





that we do not want to reduce the performance, M(x), below 
some minimum level. The value of x corresponding tc this 
level may not be known. In both of these cases rE ey wish 
momlimit the excursions to some given multiple or function of 


lx, - 6|, with a high probability, while still being certain 


il 
that a O a eaa lacy 1. To accommodate this 
situation Kushner [Ref. 79] presented estimates of the 


Bollowing form: 


Bor any m < © and even integer r, 


PI max |x, -8] ae] < [E(x_-0)* + Sn Wiese 
MN? N Ia 
where Sn depends on the sequences a and c and can be 
r 


Meaeemarbitrarily small for each fixed N and r, while 


w OCh probability I is still ensured. 


P CECERATED CONVERGENCE FOR THE K-W PROCESS 

Pen ene case of the Robbins—Monro process, the rate 
of convergence of the K-W process can be increased by using 
Kesten's algorithm [Ref. 69] (See Sec. III.J). Another 
meeiod for accelerating convergence was proposed by Fabian 
meee 40) who later showed [Ref. 45] that the multidimensional 
procedure for functions, f, sufficiently smooth at 0, 
the point of minimum (or maximum) can be modified in such 
a way as to be almost as speedy as the R-M method. This 
modification consists of making more observations at every 


meep and Of utilizing these to eliminate tne effect of all 
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derivatives ad eyax,?, MHS. Dsl. sey oom. Let ô be the 
distance from the approximated 9 after n observations. Under 


Similar conditions on f as those used by Dupac |Ref. 32] the 


a o(n78/(s*1)) can be obtained. Under weaker 


ee 


result Eô 
n 
conditions it was proved that OA 
Meovability 1 for every € > 0. 
In a follow-up paper Fabian [Ref. 46] noted that there 
mee many designs, d, which achieve the speed of ES © as 
Bmeared above. He derived the dependence relation of d on 


Lim P ES ° 


so that one may choose the design which minimizes ES ©. 

In yet a third paper in this series by Fabian [Ref. 48], 
the results of a design which minimizes Es 1s Uralized 
wooerabian achieved the result 


alu ole + 
a n 


= 
log u 
where t_ equals the number of observations necessary to 


construct See Ne 
ur, “nn 


G. THE CONTINUOUS KIEFER-WOLFOWITZ PROCESS 

As with the Robbins-Monro method we have a continuous 
analog of the Kiefer-Wolfowitz method. Let us consider a 
method, as discussed in Loginov's survey [Ref. 81], for an 


emeodie random process Y Let x denote an N-dimensional 


ec 


54 





> ton ie mens tons tabue lidíian 


space with orthonormal basis SA e Then the regression 


vector with coordinates x 
Minetion is M(x) = ELY, (x) J. Moreover assume that 

ye 07] Re] - Yealxzett)e, ], 
where c(t) is some positive function. Then the continuous 
KW method of determining a minimum point for a regression 


Mae clon is described by the equation 


dx 


le -] S 
aoe. Se 
are alt)I, ¿o (Edy, ger seit] 
Zeh initial conditions Xi q” x, (0), Ore ee ney. eagle, 
3 


where 
E e RE) 
i,t 4 Sq c u N 


2... a : E E i 
Here G, is a monotonic function with derivative bounded on 


[b, - &,b,] and 


a 


and 6, is a monotonic function with derivative bounded on 


La, ‚a, + 8] and 


S 0 for X 
G, (x) q 


for X= Aa 


{Vv 
a) 
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and FR (y)=2- y/ect) and MEOD eee for e > 0. 
The essential difference between the original discrete 

Meter -Wolfowitz method and this continuous version is the 

M nat here the observations need not be independent as 


Mere in the discrete case. The term I; serves to 


E 

2 

limit the variable X, to the interval [a,,b, J]. 
sakrison proved the following convergence theorem for 


ale” continuous K-W process. 


THEOREM [Ref. 92] 


Represent Y+ Asche Torm 


N 
(x) = E mG) 
y, (x) Ae 8) jt 


where MA , are ergodic random processes that are bounded 

2 
feel probability one, while ED are r unctions whose second 
Parcial derivatives with respect to X, are bounded. 


Now let D denote any of the random processes V 


ttp e,ttp 
or eto n,t+p (eon) =e. aN). Moreover let F, be ar / 
bounded functional defined on the processes aes Gt = i) 
2 
and Bep(P) = Mi(P, - MCF, )) (Day, - M(D, 59} be such that 


IBanle)l < ond (K,/0°) 


where K, < +e. Assume that the regression function, M(x) 


satisfies the conditions 
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2 
(grad M(z)|,_,>%-0) > Ky Feen , 


|lerad Mx) [1° < K I]x-el1° 








2 5 
a 5 
3 
OX. 
i 
meee 2,...,N. Then if the relations 


O at = S 
2 
Deal. <. 
J atmwaelyerjdt <= 
ol 
ae (todat < 
mor the functions a(t) and c(t),,the solution of the 
ee@ochwastic differential equation converges to 9 in mean 
were, i.e. 
; 2 
Lim zu ud. 
t> 


Pieexampie of functions satisfying the above conditions 


are 


al 





- and c(t) = E 5 


(t+1)% | (t+1)! 








where 


Se 
A 
Q 

IA 

2 


and ve L TERG), 


Example (by Sakrison [Ref. 92]). 


ace = | and Y 2% 
then 
2 *5 
Et| |x, AO. 


meamicemou Giifreult evo see that in the continuous case 
mae requirements of the theorems are considerably more 
cen than those in the discrete case. Here constrain s 
@eememoosed on the process itself, not just on the regres- 
sion function. This is the fundamental difference betweer 


rete and continuous stochastic approximation methods. 
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V. PRACTICAL ASPECTS 


fee CHOICE OF a, 

In section IITL.). a theorem of Dvoretzky [Ref. 36] was 
presented giving a formulation for the sequence, a. which 
menoovimal in the minimax sense. However this formulation 
eemvains parameters which will in general be unknown to the 
Peeperimenter. The need then arises for a method of choosing 
peeoeauence. 

Hodges and Lehmann [Ref. 64] recommended using coefficients 
athe form as c/n where c is chosen to minimize the 
asymptotic variance, o°0°/(2a,¢ - 1). This leads to 


choosing c = 1/a, where a, is the slope of the response 


1 
function, M(x), at the desired level of x= 6. (I.e. choose 
C = 1/a, , where a, = M'(8).) This does not reduce the 
experimenter's dilemma since it requires a priori estimation 
of another unknown parameter. It does however provide a 
Bee Tor sensitivity analysis on expected squarred error 
ee adon changes in the multiplier, c, in terms of a, 
Computer simulations were performed by Hodges and Lehmann 
[Ref. 64] and by Wetherill [Ref. 130] with very similar 
mesoults. 

imjeeeneral choosing ce < 1/2a, should be avoided since 
the asymptotic behavior is unknown and simulation experiments 


indicate that large biases exist when c is chosen to be too 


emaıı. Similarly when c is chosen too large the asymptotic 


e 





Meriance increases, however it increases slowly for ca. A 
mms when the value of Ay is unknown it would be more 
Mec iirable to overestimate the value of e than to 
Eoerestimate c. 

me spectalecase where M(x) is linear it is easily 
shown that BL c/n, with c = 1/M'(0), is a desirable 
choice. Consider the case M(x) = bx where it is desired to 
sequentially arrive at the value of x where M(x) = 0. 
Without loss of generality let 9 = 0. Thus the value of 
x = 6 for which M(x) = 0 is 8 = O, Choose ce = 1/b noting 
moo ís the slope of the response function. Then for any 


o lal value, SE inerte e dN Tue oT x, can be easily 


computed since 


implies that 


E(x,) = x, -~ = E{¥(x,)}, 


1 


where 


ElY(x,)! = M(xy ) = bx, . 


Hence 





mae desired 6. 

mucan the linear case the correct choice of e will 
Bene the estimate to the neighborhood of 9 early in the 
meocess as evidenced by the fact that the first choice 


actually produces an unbiased estimate. 


Pee GoliMATING THE SLOPE TO IMPROVE ASYMPTOTIC VARIANCE 

It was noted by Wetherill [Ref. 130] that in the simple 
case where M(x) is a linear function that it can be shown 
that when we use as the sequence of oo Cy emt tree 
Bmerce OÍ C 1S critical to the efficiency of the process 
where eificiency is defined as the reciprocal of the ratio 
Aene variance for a given ce to the variance at c = M'(6). 


See Table 1 (also see Hodges and Lehmann [Ref. 64)). 


TABLE 1 


AS motes Elireleney of the Robbins—Monro 


Process as a Function of c/M'(8) 


c/M' (6) 0.50 0.75 1.00 1.25 1.50 2.00 2,50 
efficiency 0 0.88 1.00 0.96 0.88 0.75 0.64 


ope shows that there is a large range of c for which 
the process is very efficient, with c = M'(6) being optimal. 
It also would imply that it is better to overestimate the 


value of c than to underestimate. 
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Burkholder [Ref. 12] discussed the possibility of 
estimating tne slope of M at 6 but this procedure was not 
investigated further under Venter [Ref. 121] presented an 
Pension Of the RobdbDins—Monro procedure which estimates 
Mets tope Of the regression function at the root. The 
method is similar to the Kiefer-Wolfowitz procedure in that 
at each step two observations are taken, namely O e 
Ema Y'' = Y(x,70,)? where c, = en ‘(1 + o(1)), c > 0, 

O< y € 4% £=Venter required that we know constants a and 
such that 0 < a < M'(6) <b <œ. At each step he 


estimated the slope by Ba where 


y 

1 

5 
IMS 


: (y,' = yı'')/ec,, 


and then kept the estimated slope within the established 


bounds by using An Pome Menmeoulmavemor che Slope where 


a ete BL <]. A 
A = B otherwise 
ho n 
b if Emea 
n 


Venter then defined the recursive relation 


where 


Sn Pi + Hau). 
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Venter showed that if in the choice of ee that 
fae Y < dl then 
1 
2 


2 2 
n A - 0) a A MC), 


and 
(A — M'(8)) > N(O,0°/2(1 + 2y)). 
n L 
However if y = k then 
n#(x -6) 2 N(-2a,,0°/m" (8) ,0°/2(M'(8))°), 
and 
n“(A -M'(8)) 2 N(0,0°/30°). 


Venter stated that in the case of y < k the bias in 


mme estimate, X Ome. witli dominate the error. Trere- 


nrl 
fore the choice of y= k gives a small negative bias but 
decreases ie variance in the estimate of the slope. 

One might ask whether this modified procedure is actually 
at a disadvantage since it requires two observations per 
step. Venter showed that after n steps (2n observations) 
its variance is still achieving the minimum value of the 
old Robbins-Monro procedure after 2n steps (2n observations). 


Venter also provided an estimate of of Semumeau com). jence 


iacervals could be constructed for his procedure. 
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Fabian [Ref. 47] later provided a sophisticated proof of 
asymptotic normality of Venter's procedure and of a similar 


Mrocedure applied to the Kiefer—Wolfowitz method. 


C. SMALL SAMPLE THEORY 

Consi@ering the practical applications of using stochas- 
mee approximation in experiments where infinite quantities 
of test items may not be available, it is justifiable to 
ask how small sample realizations compare with asymptotic 
mm@eory. For instance if an experimenter has less than say 


50 animals with which to determine the LD Mera Woce 


50 
50%) then one may be concerned with designing a stochastic 
approximation method with which to obtain the "best" possible 
results and an estimate of the expected error. 

ie choice of X: 

If one has prior information that 8 (for say M(8)= 

0.50) lies in a narrow interval and picks X, in that interval 
Ben one can expect the estimates to arrive in the neighbor- 
hood of 6 within a few observations. ieee Vet bc es 
meme prior knowledge of the magnitude of 8, then an initial 
bad choice of x, can cera large bias term which will 


dominate the observations for many steps. 


A Choice of Multiplier, an 


As previously discussed er c/n where c equals the 
verse of the slope of M(*) at O is optimal in a sense. 


Mts one must accurately estimate e for optimal conditions. 


mwemisetoo stall the step sizes may be too small to get to 
8 before the number of samples are depleted. Similarly 
if c is too large the estimate may overshoot 8 back and forth. 
For a detailed analysis see Section V.A. 

3. How to Allocate Samples: 

If an experimenter has N samples to test, should 
he test one at each step and take N steps or test m at each 
step and take n = N/m steps? Note that taking more than 
ememooservation at each level, Xs yields a more accurate 
estimate of M(x, ) = BO). It was noted by Wetherill 
[Ref. 130] and by Cochran and Davis [Ref. 17], and was proven 
by Block [Ref. 5], that the variance of the estimate of 0 
depends only on the total samples, not on the sampling 
~wememe, however the corresponding bias term, and hence 
Ba scuarred error, is affected by the scheme. 

Cochran and Davis presented two graphs which illus- 
trate their analysis, which is reproduced here. In their 
mematton 0 = the standard deviation of the observation, Y(x), 
at x = 8. (which in general will be unknown to us). Also 


meee che following terminology: 


MSE; Mean Squarred Error; 


Co O camacho dcesotocoefficient., es 
m : # of samples taken at each level; 
n : # of levels or steps, 


where nm = N = Total number of samples. 
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Figure 2 Figure 3 


MSE / Ge 





Figure 2 implies that if x. is relatively unknown that it 


T 
is more desirable to overestimate c so we are not "trappe " 
by a large initial bias and small steps. Figure 3 implies; 
that if the initial guess, X] > is more than about 20 away, 
Mien Sampling should be done one at a time, while if the 
merbial guess is very accurate, then the MSE's are smaller, 
although very slightly so, for larger m. Thus as a general 
rule unless we know that the initial guess is very accurate 
Gr »uniess the cost of setting up experiments at different 
feels 1S high, sampling should be conducted as one sample 


per level. 
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Mo Ner westson whiem come experimenter may ask is how 
more raceurate an estimate becomes if he doubles the 
number of samples, say from scheme 1: m = 3, n = 8 to 
Mememe 2: m= 6, n= 8. Doubling the value of m in this 
Ae reduces the variance of the estimate by almost exactly 
weena, but produces only a slight decrease in the bias. 
Bonsequentily if V, B, and M are the Variance, Bias and MSE 
for m = 3, n = 6 (scheme 1) then the corresponding MSE for 
m=6, n = 8 (scheme 2) can be predicted by the expression: 
MSE, = (ES + V/2) = (ES an2 me This is assuming x] is the 
same for both schemes.) This expression overestimates the 
teem but at most by only a few percent. 

For ROPE ner reout and comparisons of methods utilizing 
small sample theory see Cochran and Davis [Ref. 17], Davis 


(Ref. 22], Wetherill [Ref. 130], and Odell [Ref. 87]. 


Dee BSTIMATION OF EXTREME QUANTILES 

moreecsvimaces Of quantiles near the mid-region of a 
Mal: response curve the Robbins-—-Monro method appears to 
perform quite well. In fact for estimation of the 9 50 
Guantile both Wetherill [Ref. 130] and Davis [Ref. 22] 
Paoewed that Sample sizes as small as 35 produced results 
which were in good general agreement with asymptotic theory. 
However in areas away from the neighborhood of 9 50 as 
small sample estimates frequently have large biases and 


Mmeemvariances greatly in excess of theoretical predictions. 


This behavior was also noted by Stillings and Logan [Ref. 108]. 
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To try to explain this phenomenon Wetherill [Ref. 131] 
presented the following example. 

Suppose an experimenter wishes to estimate 9 90 and 
Met his initial level, SE IS Me lose to the true value. 
Suppose Mm erraten times observation is Zzero, a failure 
(as it will be about once every ten trials), then the 


second observation will be taken at the level 


Xo = X} =- c(0 - 0.90) = x, + .90c. 
This value, Xos may well be far above 0 90° Assume that 
the next two values will be positive (a success). This 
leads to 
= Fe = 
X = Xo 5 (1 .90) x, t 050 
and 
= eC = = 
Xy X3 ZU .90) x, + use. 


As can be easily observed the level of testirg is very 


Berry returning to the vicinity of 0 Da ab 


90° 

of about a observations are necessary to pass below Xy > 
Monos itsline accelerated stochastic approximation tend 

tO minimize this effect but the most interesting treatment 


of this area thus far has been done by Goodman, Lewis, and 


Robbins [Ref. 58]. Here a "maximum transformation" is 
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Employed Dy taking multiple samples at a level. If it is 
desired to estimate F(6@) = .99, where F(x) is a cumulative 
Berry Tüneseion, chen V samples are taken at each level S. 


Here V is the solution of the equation 


(0.99) 0.50 


Then let 


eee OE mna Ss.) = 5) = [F(s)]’ . 
LES 

In this case the solution for V is V = 69, and 69 samples 
would be taken at each iteration. Thus the problem has 
been transformed into estimating the 9 50 level where the 
Mmenpervies Of the Robbins—-Monro process are Known to work 
well. 

Imewens Ref. 135] followed this same "maximum trans- 
formation" technique and then applied variance reduction 
and Jack-knifing aS COD OVE the rate of convergence 


and to reduce bias. 


Peete CASH WHERE M(x) STOPS BEING A CONSTANT 
Consider a response function where there is no reaction 


MN Go (L.e. M(x) = 0 for x < 6 and M(x) > 0 for x > 0.) 
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M(x) 





Bien Tstımes one 1s interestea in the level, 0, when response 
first occurs (see Guttman and Guttman [Ref. 60]). Friedman 


[Ref. 51] proved the following theorem. 


THEOREM [Ref. 51] 


meu wae LOLMvOwine conalit1ions be satisfied; 


Ge GE E K; 

(11) 0 (x) < 0% < +0; 

Glin hensiile) = 0, 
wo xk > Osuekens Mes 


Gv) “for every 0< 6 imf ir > 0. 
| é<|x-6| 


Then choose fat, ta, ) such that 


and define the relation 


10 


un hon y Vy) 
Then E opa paso 1 and in mean square. 

This theorem says that one can use stochastic approxima- 
men tO find that point at which the regression function 
Beeps being a constant if the value of this constant is 
Pew. If one does not know the value of the constant, 
Mmepedman has proved another theorem which imposes sharper 
conditions on M(x), for which Xn does converge to the 


desired value 6! 


Pee b0tH VARIABLES SUBJECT TO ERROR 

In the usual Robbins-Monro procedure it is assumed that 
Mier regression function, M(x,) EOS ma ietsuD eco tO val 
For term, say Ne One meets keumder whatreonditions will 
the process converge if there exists a random error compon- 
ent, say Uys in the level setting of x, as DAS CMA Ce MARE 
went always possible to precisely measure or set the 
desired amount. Dupac and Kral [Ref. 35] discussed two such 
cases. In the first m the error in setting the level is 
assumed to be unaffected by the experimenter. In the second 
Bartels assumed that the error in the x level can be made 
Eo rar ly small for an inversely proportional price. In 
Ms tirst case of "irreducible errors" Dupac and Kral 


Aed the following theorem. 
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THEOREM [Ref. 35] 


Assume that the following conditions are, satisfied: 


in) MG mis odd With respect to 6, 
ort x) emo = x) for all x; 
(1a) Mis strictly increasing; 


Grrr) M(x.) -M(x, ) | ar as for all x 


1! iR 22 
(ivy) U is a symmetric random variable for each 

y E A ee); 
(v) Var U, < Cz for all x; 


NE) ror all x. 


zen the Robbins-Monro procedure defined by 


Xa T XT a, Mx, + uy) + ve! 
converges to 8 with probability 1 as well as in mean squar 
BIernerseceond Case Of Dijpac and Kral, where one can 
Sememease the x setting errors, UL; by an inversely propor- 
tional price, they proved what intuition would tell us was 
Perrect. They showed that it is needless to pay for high 

m cision at the starting steps; the precision should be 


increased in the course of the approximation process. 


G. THE CASE OF a UNKNOWN 
consider the following scenario: Suppose a scientist 
memcomparing two drugs, a test drug and a control drug. 


Memes interested in designing a biological assay to estimate 


< 





BR snımnber of gose units of the test drug necessary to 
elicit the same mean response as the standard dose of the 
rol drug. Suppose further that the experimenter knows 
little about the shape of the response function associated 
Bauern che test drug anc about the probability distribution 
response at any one dose level of either drug. 

Make the following notational identifications: Let an 
mmeerved response to the control drug administered at the 
Mara dose level correspond to the random variable, 2, 
with mean a. Let the observed response to the test drug 
at dose level x correspond to Y(x) with mean, M(x). Let 
8 be the unknown dose level of the test drug such that 
M(6) = a. Then under weak conditions on M(x), and the 


@mecributions of Y(x) and Las the process defined by 


aa on E an: 
Saeesiies all known properties of the original Robbins- 
Monro procedure. It Secus as was noted by Hamilton [Ref. 62], 
that this procedure does not use all available information 
a each step. grea n 2 2, is a better estimator of 
enthan just 2,» one Be, a smaller mean squarred 
error from the sequential estimate of a, especially in 


aces Ol small sample sizes. To analyze this Hamilton 


compared two processes. 
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Proceso 1 takes multiple observations at each step 
and computes an estimated value of a based only on 
the observations taken at that step (possibly 


only one). 


Process 2 takes the same number of multiple obser- 
Dar lonssetseaen step But computes the estimate of 
o based on all of the observations from the 


Peon OL Lie process . 


DemEBLLOT ES chHersshoweorchat under certain conditions it 
mamocuuer, in Magnitude of mean squarred error, to take 
the most recent control observations (process 1) rather 
mea caking Sequential steps toward the mean of the control 
Geer vacion. inis result, based on large sample theory, 


remains true in a simplified (linear) small sample situation. 
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Vee ABRRETCATIONS 


In chissseetionssevera) applications of stochastic 
Nros imation FO a Variety of fields will be presented. 
iiestirst. example is an application to a problem in bio- 
logical research by Guttman and Guttman [Ref. 60]. It is 
wally presented since it is a simple, straightforward 
problem of the type for which the Robbins-Monro method was 
conceived (also see Hawkins [Ref. 63]). This straight- 
forward use of the R-M method is also applicable to indus- 
trial process control as discussed by Comer [Refs. 18 and 
19) where a lag in process response is incorporated into 
pate” Tormulation. 

However, more practical use of stochastic approximation 
memeased on the concepts of maximization or minimization of 
numetions. Many problems which can be analytically solved 
if the response format is known fall nicely into the sto- 
enastic approximation framework since answers do not depen 
on the assumed parameterization. Also many problems based 
Bameecraverion, such as minimizing expected squarred error, 
ema be computationally very difficult to solve, as the 
ons may require matrix inversions, as in the multi- 
dimensional case. Many problems of this type (see Sardis, 
Nikolic and Fu [Ref. 99]) fall into the stochastic approxi- 
mation framework and yield computationally simple algorithms 
which require very little storage space when performed on 


a digital computer. 
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In a recent book edited by Mendel and Fu [Ref. 83] a 
chapter has been devoted to applications of stochastic 
approximation methods. Also Tsypkin [Ref. 112] has nicely 
reviewed the important applicability of the Robbins-Monro 
process and related stochastic approximation methods to 
meemleoms concernine pattern recognition, adaptive filters, 
Aa tive automatic control systems, and adaption in opera- 
tions research and reliability theory. Some of the additional 
Srs inten have eonsidered these latter types of application 
of stochastic approximation are by Aizerman et al. [Ref. 1], 
Ernst [Ref. 38], Kailath and Schalkwijk [Ref. 67], Lee 
mer. 80], Sakrison [Refs. 93, 94], Sklansky [Ref. 106], 


Mey pokin [Ref. 111] and Ulrich [Ref. 116]. 


A. APPLICATION TO A PROBLEM IN BIOLOGICAL RESEARCH 

Guttman and Guttman [Ref. 60] desired to treat Para- 
mecium Caudatum cells with a substance, Kinetin, which wou 1 
mu late cell division, and to estimate the time at which 
a certain level of-this cell division wes attained. They 
e i ulated that the ratio of the number of daily cell 
divisions of treated paramecia to untreated paramecia (K/C) 
Mea monotone increasing function of time of exposure to 
Kinetin. Guttman and Guttman stated that they had no idea 
mene underlying probability distribution concerning the 
Patio, K/C, thereby making stochastic approximation a very 
Semvenient scheme. A Robbins-Monro scheme was formulated 


Mmemmestvimace the time at which K/C = 1.10. The initial 
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Bless of X, = 30 hours was chosen with the expectation 
mau the desired value of A was somewhere below this. 
The sequence fa} Waemonmosen as 20/14 to allow for 
merce corrections in the first few steps and smaller 
Mire ctions thereafter. The stochastic approximation 
sequence, as formulated for this problem, then looked like 
2 


u 0 
E MAA 


An+1 
where Ya = the observed response ratio at time x: Guttman 
amd Guttman's table of observations, 1o and computed next 
levels, Xa: is reproduced in Table 2. 

The experiment was terminated at n = 13 as no appreciable 
differences appeared among the from trial 6 onward. 
Note that the mean value of the observations from n = 6 


Boards is in fact equal to 1.10. 


fee AN APPLICATION TO TAILORED TESTING 

Puppoese an educator or psychologist desires to measure 
Ame mental trait of an individual. For instance suppose 
it is desired to measure the tvcoimovaGmouiculty of questions, 
sue that the individual will get, say a = 70% of them 
Meer. Suppose further that the educator has a bag full 
Seeecestions, Gach assigned a level of difficulty, B: > Suen 
what the probability that an individual, whose true ability 
Meat level i, will correctly answer a question of difficulty 


B, mc aio e ==. (0. this is similarly written 


N 





METAL n) 
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10 
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TABLE 


Sco chacie Tp roimaa lon of Hours of 


Tresemenumeheoumeed with 1.5 mg/l 


Fimevrinsco -reduee an Expected Ratio 
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P, (B, ) = 


This idea was presented by Lord [Ref. 82] who proposed 
EmecompuLcver controlled testing scheme where questions of 


difficulty B, would be recursively selected by the scheme 


Er ee a, 1Y(B, ) - a}. 
ús tne scheme would eventually converge to the 
medals true ability, provided that the assumptions 


were correct. 


fe UPGRADING OF INERTIAL NAVIGATION SYSTEMS 

Consider a navigational platform with several high grade 
Gyro's required for motion sensing. Bernard Lee [Ref. 80] 
Suecested replacing all but one gyro with a lower grade, I-ss 
expensive gyro. A supervisory system based on a continuou 
Eeer Wolfowitz stochastic approximation algorithm similé ° 
to that developed > Sakrison [Ref. 90] is then used to 
estimate the dra ee aeeoa each or the low grade gyros and 
ao ly a corrective Signal. This concept permits each 
eeesvandard gyro to acquire a precision approaching that 


Acne higher gyro. 


D. O ETON TOER DISTRIBUTION AND DENSITY FUNCTIONS 
Consider the distribution F(a) = Prob [X < a] where 


X is a scalar random variable. The problem is to find en 
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approximation to F(:) by a linear combination of a previously 
chosen vector of functions, $ (x) = (f(x), fo(x),...,f,(x)), 
Dre the Superscript T denotes the transpose of the column 
fever y (Xx). Thus we desire to find a column vector of 

meets icients, C, such that our approximation 


F(x) = Cp 9 (x) 


femimiZes some criterion such aS minimizing expected squared 
error in a region of interest (a,b). Denote the mean square 
error as 

b 


IpO = f PO - Cp ¢(x)} ax. 


Now minimizing JC) is equivalent to solving the matrix 


equation 
b b 
api), „ f F(x) o(x)ax - ae fe) 8 (x)dx = 0 
ac a 5 5 a ~ 5 
or 
2 T 
f F(x) o(x)dx - K Co = 0, 
N 1 = 
where 
p T 
O a 
ze Pe 7 
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stan n x n matrix. 


Now define a random function Za(y 5x) such that 


ni in es 
Zeal y 5X) = 


0 ie eas oe, 
and such that 
BEZ es Ponte OC F(x) = F(x). 


Thus the regressive matrix equation 


E T 
eyes) OC)! aK =. 
a "y "y "y 


Neme quivalent to our previous equations for finding the 
Maimum of the criterion, Jala). But this can now be solved 
Ma stochastic approximation algorithm if successive 
wdependent samples of the random variable, Y, are availab e. 


The algorithm can be written as 
Cp(3+1) = Cp(g) + a, [BA(Y(J) - K Cp(3)) 
where we define 


b 
BPD = IDEA 
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b 
J $ (x)ax ir y(j) <a 


a 
j b 
SAGE f  o(x)dax rs amy GD <b 
i y(j) 
0 O. 


h 


where y(j) is the J sample from the distribution and 


mee sequence an SEE 
00 00 
L a. = 0 and L a < +0, 


Thus the above algorithm now fits the format of multi- 
eme nSional stochastic approximation. In particular, if 
the matrix K is positive definite, it satisfies the conditions 
of a theorem by Blum [Ref. 7, theorem 2]. 

Then the sequence Cati) converges with probability 1 to 
the value which lese Ja (C). This value can be written 
as 


b 
Cae f F(x) $(x)dx, 


~ F' 
requires inversion of an n x n matrix to solve directly. 
nerefore the above algorithm enables one to find a minimum 
m Aare error approximation to a distribution function 


merewhiieh the only available information is the collection 
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erecemple Values randomly selected. This algorithm is from 
a paper by Blaydon [Ref. 4] and can be similarly extended to 
meproximare density functions. 

A refinement of this algorithm by Blaydon was presented 
by Deuser and Lainiotis [Ref. 27]. The refinement incorporates 
fmoouple stochastic approximation algorithm to recursively 
Benerate a matrix from each independent observation and 
men to recursively generate the estimate of the coefficient 
vector using the previously generated matrix as an cbserva- 
trom. Weuser and Lainotis presented the e€xample where the 


unknown probability is F(x) = 1 - ae 


RORE? O. 
The approximating function, F(x), is to be a weighted 


moet the first three Laguerre polynomials 


1 
o(x) = er , E 
ee 


Peeve initial choice of the coefficient vector is the 
Zero vector. It can be shown analytically that the optimal 


coefficients are: 
O ~. 186 -.239) 


In a computer simulation using 1000 samples and using 
the step sequence er 1/n, Deuser and Lainiotis obtained 


mates Tor the coefficients which, on the average, did 
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not differ by more than .01 in absolute value from the 
Seuimal coefficients. 
fee’ APPROACH TO PATTERN RECOGNITION USING STOCHASTIC 
APPROXIMATION TO MINIMIZE RISK 
Consider a mixture from two samples where an observation 
which is drawn at random is of type 1 with unknown proba- 
ty p, and is then of type 2 with probability 1 - p. 
It is desired to measure some quality of the samples, call 


men, and apply a decision rule say 


= 1 UNA SARA 
dean 
= 2 ante >, 


where až = 8 is some unknown value which minimizes a risk 
function, R(d(x,a)) which we have en Since the choice 
of a completely specifies the decision rule and risk function, 
denote them d(a) and R(a). 

Now R(a) can be viewed as a regression function. By 
this it is meant that tnere exists a random variable, Y, 
with conditional prema Oeslay dr suri1bucion function F(Yla) 


such that 
Pica) a=) Cala). 


puente random variable, Y, is defined as follows: 
Wet yY (given a) = Ca E aan obee r vation which 
aeuuallyers of type 1 and is classified 


eyaz a) = type- j. 
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In general C. = 0. 


T 


tocn a simple one dimensional stochastic approximation 


scheme can demonstrate the solution process. Consider a 


test .sample where it is not known of which population each 


item is a member. Then define the scheme 


where a =C.: 
a 


2,1, nt) 


of type i and a(z,,,a 


a IS e ER: 


ne n 2 qd ); 
n 


1f sample 2 1 iS actually of type- i and 


2n- 


sample 2 


ij op İS actually 


- a.) type j, where A. is chosen 


n l 


meorcrarily and the conditions 


are satisfied. 


and 


2 
Go Le Rs arco? 


2 


n=l] 


NevewumM@ar tme = rlsk function must satisfy 


sup D R(a) > 0 
1/k <a-8<k 
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aerate Pk a) < O for all-integers K, 


1/k<a-6<k 


where D R(a) = the limit superior of 


Bee) 


a Toren) 


moe) R(a) = the limit inferior of 


Rlath) - R(a) 


a TOTAU VO. 


Note that R(a) does not have to be differentiable at all a. 
Mieq it the above conditions are satisfied, a, converges 
: ar 2 2 
fn probability to 90 and lim L(a, -8) )= 0. 
n > 


Then the decision rule which minimizes the risk function 


is 


a(X,9) = 


The above m mens onai scheme was presented by Cooper 
Ref. 20] who stated that the application to a K-dimensional 
teme including noise could be performed by modifying the 
above procedure to the multidimensional case of Blum [Ref. 71]. 
It is noted that the above sample falls into the frame- 
Work of Bayesian learning and decision rules. An excellent 


paper by Chien and Fu [Ref. 13] discusses Bayesian related 
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mearning procedures which can be shown to be a special case 
of stochastic approximation algorithms and hence can be 
carried out in computationally simple schemes as the one 


gust presented. 
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Vii homie FURTHER STUDY 


This Section is devoted to stating particular areas 
where further work may be of interest. These ideas have 
meen noved as either not being discussed in the current 
imeweratvure or as having been analyzed when required 


conditions were not satisfied. 


he DEVIATIONS FROM THE LINEAR CASE 

section IIlI.G discussed the estimate of expected squared 
Peer Tor the linear case and mentioned that other than a 
Bun emezexperiment Dy LTelchroew, very little analysis had 
been done. Work needs to be done in this area to determine 
emacs of departures from linearity where linear results 


main valid. 


Eee STOPPING RULES 

Stopping rules not based on bounded confidence interva s 
mamliZine aymptotic normality are almost nonexistent. Whe. 
femeccocd omne Nonparametric stopping rule based on say, 


number of changes of sign of (x, - ). Many authors have 


ea 
moved that this is a virtually untouched area yet almost 


nothing has appeared in the literature. 


C. POSSIBLE WEAKENING OF CONDITIONS ON a, 
In Comer's paper "Application of Stochastic Approximation 
memerocess Control” [Ref. 19], an error in the formulation 


of a Robbins-Monro process yields interesting results. 
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Comer mistakenly used the step sequence a 1/(n)*? in 
a simulation comparison. Note that this sequence does not 


00 


satisfy the requirement 2 (a) < o, However his results 
when compared with the en using Wis 
mmc dees Savisiy the necessary requirements, shows that 
mime sequence, ON 1/n*? gives comparable if not superior 
results. The idea to explore is (1) Comer's simulation 


error, or (2), can the conditions on a, actually be weakened 


mimoracl ce to obtain more desirable results. 


D. REPEAT SIMULATION OF THE KIEFER-WOLFOWITZ PROCESS 

RE Nous samulation comparison study of Kiefer- 
Wolfowitz type methods, Springer [Ref. 107] used as a 
Sequence of norming constants the sequence where a4) 7 Any?‘ 
He discussed the result of finding a small sample bias which, 
öne should note, can be attributed to the fact that this 


00 


Ame nce does not satisfy the assumption that 2 a, = con 
n=1 
Mérhaps a new simulation study using proper coefficients í 


in order. 


E. MULTIDIMENSIONAL EXTENSION OF DUPAC AND KRAL's RESULTS 
Dupac Ana Kral |Ref. 35] (see Sec. V.F) examined the 
BEER ilonro one dimensional case where there are errors in 

Sins the X-level.. They cited conditions where 1 R 6 
mmen Cheese errors exist. They noted that errors of this 
type make the Kiefer-Wolfowitz procedure practially 
inapplicable to this type of analysis, but speculated that 
a generalization to the multidimensional case malen: be of 
marerest. 
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